Scalable Data Space Partitioning In High Dimension

نویسندگان

  • Theodore Johnson
  • Tamraparni Dasu
چکیده

A fundamental process in data mining is to approximate the joint distribution of a multivari-ate data set. A typical approach is to use histograms. However, conventional histograms (with dimension-normal data partitions) create an exponential number of partitions as the number of dimensions increases. In previous work, we presented the DataSphere method for data space partitioning and used it for several types of analysis. A DataSphere creates O(d) partitions on d-dimensional data. In this paper, we generalize DataSphere partitioning to create any polynomial number O(d k) of partitions where k ranges from 1 to d by using hyperpyramids and hyperspheres. Each of these partitioning schemes is hierarchical, allowing data cube-like roll-up and drill-down data analysis. We present several examples of how these partitioning schemes can be used, including visualization.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient high dimension data clustering using constraint-partitioning k-means algorithm

With the ever-increasing size of data, clustering of large dimensional databases poses a demanding task that should satisfy both the requirements of the computation efficiency and result quality. In order to achieve both tasks, clustering of feature space rather than the original data space has received importance among the data mining researchers. Accordingly, we performed data clustering of h...

متن کامل

Effective Spatial Data Partitioning for Scalable Query Processing

Recently, MapReduce based spatial query systems have emerged as a cost effective and scalable solution to large scale spatial data processing and analytics. MapReduce based systems achieve massive scalability by partitioning the data and running query tasks on those partitions in parallel. Therefore, effective data partitioning is critical for task parallelization, load balancing, and directly ...

متن کامل

Low-Quality Dimension Reduction and High-Dimensional Approximate Nearest Neighbor

The approximate nearest neighbor problem ( -ANN) in Euclidean settings is a fundamental question, which has been addressed by two main approaches: Data-dependent space partitioning techniques perform well when the dimension is relatively low, but are affected by the curse of dimensionality. On the other hand, locality sensitive hashing has polynomial dependence in the dimension, sublinear query...

متن کامل

A Class of Region-preserving Space Transformations for Indexing High-dimensional Data

This study introduces a class of region preserving space transformation (RPST) schemes for accessing high-dimensional data. The access methods in this class differ with respect to their spacepartitioning strategies. The study develops two new static partitioning schemes that can split each dimension of the space within linear space complexity. They also support an effective mechanism for handli...

متن کامل

Interactive Rendering of Volumetric Data Sets

The bela architecture for interactive rendering of regularly structured volumetric data sets is presented. The proposed architecture is scalable and uses custom processors to achieve high-speed shading, projection. and composition of voxel primitives. A general purpose image composition network supports the accumulation of both volumetric and geometric elements into the final rendered scene. Da...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999